Modular Architectures and Statistical Mechanisms

نویسندگان

  • Matthew W. Crocker
  • Steffan Corley
  • Matthew Crocker
چکیده

This paper reviews the modular, statistical model of human lexical category disambiguation (SLCM) proposed by Corley and Crocker (2000). The SLCM is distinct lexical category disambiguation mechanism within the human sentence processor, which uses word-category frequencies and category bigram frequencies for the initial resolution of category (part-of-speech) ambiguities. The model has been shown to account for a range of existing experimental findings in relatively diverse constructions. This paper presents the results of two new experiments that directly confirm the predictions of the model. The first experiment demonstrates the dominant role of word-category frequency in resolving noun-verb ambiguities. The second experiment then presents evidence for the modularity of the mechanism, by demonstrating that immediately available syntactic context does not override the SLCMs initial decision. 1 This paper presents entirely joint work, and the order of authors is arbitrary. Correspondence should be sent in the first instance to M. Crocker ([email protected]). The authors would like to express particular thanks to Charles Clifton, Jr. and Martin Corley for their invaluable assistance. The authors gratefully acknowledge the support of an ESRC Research Fellowship (to Matthew Crocker, #H5242700394) and an ESRC Studentship (to Steffan Corley, #R00429334081), both of which were held at the Centre for Cognitive Science, University of Edinburgh. 2 Matthew W. Crocker and Steffan Corley Introduction This paper reconsiders the nature of modular architectures in the light of recent empirical, theoretical and computational developments concerning the exploitation of statistical language processing mechanisms. We defend a simpler notion of modularity than that proposed by Fodor (1983). Given current conflicting theoretical arguments and empirical evidence for and against modularity, we argue for modularity strictly on computational and methodological grounds. We then apply this to a particular aspect of human language processing: the problem of lexical category disambiguation. While previous work has often focused on the kinds of linguistic knowledge which are used in ambiguity resolution, we focus on the role of statistical, or frequency-based, knowledge. While such mechanisms are now a common element of non-modular, constraint-based models (see Tanenhaus et al (in press)), we argue that probabilistic mechanisms may be naturally associated with modular architectures. In particular, we suggest that a Statistical Lexical Category Module (SLCM) provides an extremely efficient and accurate solution to the sub-problem of lexical category disambiguation. Following a summary of the model and how it accounts for the range of relevant existing data, we review the results of two new experiments that test the predictions of both the statistical and modular aspects of the SLCM, and provide further support for our proposals. Modularity, Constraints and Statistics The issue of modularity continues to be a hotly debated topic within the sentence processing literature. Parser-based models of human sentence processing led to the tacit emergence of syntactic modularity, which was then rationally defended by Fodor (1983). In particular, Fodor argued that cognitive faculties are divided into input processes, which are modular, and central processes, which are not. The divide between input and central processes is roughly coextensive with the divide between perception and cognition; in the case of language, Fodor located this divide between the subject matter of formal linguistics and that of pragmatics and discourse analysis. Recently, their has been a shift in consensus towards more interactionist, 2 See Crocker (1999) for a more complete introduction to the issues presented in this section. Modular Architectures and Statistical Mechanisms 3 non-modular positions. The term ‘constraint-based’ is often used to denote such an interactionist position. The constraint-based position is tacitly assumed to imply that all constraints can in principle apply immediately and simultaneously, across all levels of linguistic representation, and possibly even across perceptual faculties (Tanenhaus et al, 1995). Modular and interactive positions are often associated with other computational properties. Spivey-Knowlton and Eberhard (1996) argue that modular positions tend to be symbolic, binary, unidirectional and serial. In contrast, interactive models tend to be distributed, probabilistic, bi-directional and parallel. Further, Spivey-Knowlton and Eberhard suggest that “when a model is specified in enough detail to be associated with a region in this space, that region’s projection onto the continuum of modularity indicates the degree to which a model is modular” (pp. 39 – 40, their italics). Spivey-Knowlton and Eberhard’s position turns a historical accident into a definition. While existing models do pattern approximately along the lines they propose, we suggest that their characterisation inaccurately represents the underlying notion of modularity. We propose a simplified definition of modularity that is independent of any commitment to orthogonal issues such as the symbolic-distributed, binary-probabilistic, unidirectional-bidirectional and serial-parallel nature of a particular theory. Rather our definition focuses purely on information-flow characteristics: • A module can only process information stated in its own representational and informational vocabulary. For example, the syntactic processor can only make use of grammatical information. • A module is independently predictive. That is, we do not need to know about any other component of the cognitive architecture to make predictions about the behaviour of a module (provided we know the module’s input). • A module has low bandwidth in both feedforward and feedback connections. By this we mean that it passes a comparatively small amount of information (compared to its internal bandwidth) on to subsequent and prior modules. 3 Of course their characterisation does define a particular computational position which one might dub ‘modular’, but the falsification of that position crucially does not falsify the general notion of modularity, only the particular position they define. 4 Matthew W. Crocker and Steffan Corley These three defining properties of a modular architecture overlap. If one module cannot understand the representational vocabulary of another, then information about its internal decision process is of no use; thus the cost of passing such information on would not be warranted. Similarly, a module cannot be independently predictive if its decisions depend on representations constructed by other modules that are not part of its input — independent prediction is therefore directly tied to low bandwidth feedback connections. In sum, we propose a simple definition of modularity in which modules process a specific representation and satisfy the relevant constraints which are defined for that level of representation. Modules have high internal bandwidth and are connected to each other by relatively low bandwidth: the lower the bandwidth, the greater the modularity. This definition is independent of whether we choose to state our modules in more distributed or symbolic terms, as it should be. Statistical Mechanisms In the previous section, we noted Spivey-Knowlton and Eberhard’s (1996) claim that modularity is normally associated with binary rather than probabilistic decision procedures. This claim derives largely from the association of constraint-based architectures with connectionist implementations (Tanenhaus et al, in press; MacDonald et al, 1994) which in turn have a natural tendency to exhibit frequency effects. We proposed a definition of modularity which is consistent with statistical mechanisms. In this section, we argue that modularity and statistical mechanisms are in fact natural collaborators. The motivation for modularity is essentially one of computational compromise, based on the assumption that an unrestricted constraint-satisfaction procedure could neither operate in real-time (Fodor, 1983), nor could it acquire such a heterogeneous system of constraints in the first place (Norris, 1990). It is still reasonable to assume however, that modules will converge on highly effective processing mechanisms; that is, a mechanism which can accurately and rapidly arrive at the correct analysis of the input, based on the restricted knowledge available within the module. For purposes of disambiguation, the module should therefore use the best heuristics it can, again modulo any Modular Architectures and Statistical Mechanisms 5 computational and informational limitations. In the spirit of rational analysis (Anderson, 1991), one might therefore choose to reason about such a mechanism as an optimal process in probabilistic terms. This approach has been exploited both in the study of human sentence processing (Chater et al, 1999; Jurafsky, 1996) and in computational linguistics where statistical language models have been effectively applied to problems of speech recognition, part-of-speech tagging, and parsing (see Charniak (1993; 1997) for an overview). We propose a specific hypothesis, in which modules may make use of statistical mechanisms in their desire to perform as effectively as possible in the face of restricted knowledge. We define statistical modularity by introducing the ‘Modular Statistical Hypothesis’ (MSH): The Modular Statistical Hypothesis : The human sentence processor is composed of a number of modules, at least some of which use statistical mechanisms. Statistical results may be communicated between modules, but statistical processes are restricted to operating within, and not across, modules. This hypothesis encompasses a range of possible models, including the coarse-grained architecture espoused by proponents of the Tuning Hypothesis (Mitchell et al, 1995; Mitchell & Brysbaert, to appear). However, it excludes interactive models such as those proposed by MacDonald et al. (1994), Tanenhaus et al (in press) and Jurafsky (1996) – despite their probabilistic nature – since the models that fall within the MSH are a necessarily subset of those that are modular. In the case of a statistical module we assume that heuristic decision strategies are based on statistical knowledge accrued by the module, presumably on the basis of linguistic experience. Assuming that the module collates statistics itself, it must have access to some measure of the ‘correctness’ of its decision; this could be informed by whether or not reanalysis was requested by later processes. The most restrictive modular statistical model is therefore one in which modules are fully encapsulated and only offer a single analysis to higher levels of processing. The statistical measures such a module depends on are thus architecturally 6 Matthew W. Crocker and Steffan Corley limited. Such measures can not directly reflect information pertaining to higher levels of processing, as these are not available to the module. Assuming very low bandwidth feedforward connections, or shallow output, it is also impossible for the module to collate statistics concerning levels of representation that are the province of modules that precede it. A modular architecture therefore constrains the representations for which statistics may be accrued, and subsequently used to inform decision making processes; this contrasts with an interactive architecture, where there are no such constraints on the decision process. It is worth noting that we have argued for the use of statistical mechanisms in modular architectures on primarily rational grounds. That is, such statistical mechanisms have been demonstrated to provide highly effective heuristic decisions in the absence of full knowledge, and their use is therefore highly strategic, not accidental. Indeed, it might even be argued that such mechanisms give good approximations of ‘higher-level’ knowledge. For example, simple word bigrams will model those words that co-occur frequently or infrequently. Since highly semantically plausible collocations are likely to be more frequent than less plausible ones, such statistics can appear to be modelling semantic knowledge, as well as just the distribution of word types. In contrast, constraint-based, interactionist models motivate the existence of frequency effects as an essentially unavoidable consequence of the underlying connectionist architecture (see Seidenberg (1997) for general discussion), along with other factors such as neighbourhood effects. Interestingly, this may lead to some rather strong predictions. Since such mechanisms are highly sensitive to frequency, they would seem to preclude probabilistic mechanisms that do not select a “most-likely” analysis based on these prior frequencies. Pickering et al (2000), however, present evidence against likelihood-based accounts, and propose and alternative probabilistic model based on a rational analysis of the parsing problem (Chater et al, 1999). Lexical Category Ambiguity The debate concerning the architecture of the human language processor has typically focused on the syntax-semantics divide. Here, however, we consider the problem of lexical category ambiguity, and argue for the plausibility of a Modular Architectures and Statistical Mechanisms 7 distinct lexical category disambiguation module. Lexical category ambiguity occurs when a word can be assigned more than one part of speech (noun, verb, adjective etc.). Consider, for example, the following sentence: (1) He saw her duck. There are two obvious, plausible readings for sentence 1. In one reading, ‘her’ is a possessive pronoun and ‘duck’ is a noun (cf. 2a); in the other reading, ‘her’ is a personal pronoun and ‘duck’ is a verb (cf. 2b). (2) a) He saw herPOSS apple. b) He saw herPRON leave. Lexical Category Ambiguity and Lexical Access Lexical access is the stage of processing at which lexical entries for input words are retrieved. Evidence suggests that multiple meanings for a given word are activated even when semantic context biases in favour of a single meaning (Swinney, 1979; Seidenberg et al., 1982; but see Kawamoto (1993) for more thorough discussion). The evidence does not, however, support the determination of grammatical class during lexical access. Tanenhaus, Leiman and Seidenberg (1979) found that when subjects heard sentences such as those in (3), containing a locally ambiguous word in an unambiguous syntactic context, they were able to name a target word which was semantically related to either of the possible meanings of the ambiguous target (e.g. SLEEP or WHEEL) faster than they were able to name an unrelated target. (3) a) John began to tire. b) John lost the tire. This suggests that words related to both meanings had been primed; both meanings must therefore have been accessed, despite the fact that only one was compatible with the syntactic context. Seidenberg, Tanenhaus, Leiman and Bienkowski (1982) replicated these results, and Tanenhaus and DonnenworthNolan (1984) demonstrated that they could not be attributed to the ambiguity (when spoken) of the word ‘to’ or to subjects inability to integrate syntactic information fast enough prior to hearing the ambiguous word. Such evidence is consistent with a model in which lexical category disambiguation occurs after lexical access. The tacit assumption in much of the sentence processing literature has been that grammatical classes are determined 8 Matthew W. Crocker and Steffan Corley during parsing (see Frazier (1978) and Pritchett (1992) as examples). If grammar terminals are words rather than lexical categories, then such a model requires no augmentation of the parsing mechanism. Alternatively, Frazier and Rayner (1987) proposed that lexical category disambiguation has a privileged status within the parser; different mechanisms are used to arbitrate such ambiguities from those concerned with structure building. Finally, lexical categories may be determined after lexical access, but prior to syntactic analysis. That is, lexical category disambiguation may constitute a module in its own right. The Privileged Status of Lexical Category Ambiguity There are essentially three possible positions regarding the relationship between syntax and lexical category. 1. Lexical categories are syntactic: The terminals in the grammar are words and it is the job of the syntactic processes to determine the lexical category that dominates each word (Frazier, 1978; Pritchett, 1992). 2. Syntactic structures are in the lexicon: The bulk of linguistic competence is in the lexicon, including rich representations of the trees projected by lexical items. Parsing is reduced to connecting trees together (MacDonald et al, 1994; Kim and Trueswell, this volume). 3. Syntax and lexical category determination are distinct: Syntax and the lexicon have their own processes responsible for initial structure building and ambiguity resolution. If we take the latter view of lexical category ambiguities, one possibility is that a pre-syntactic modular process makes lexical category decisions. These decisions would have to be made on the basis of a simple heuristic, without the benefit of syntactic constraints. In common with all modules, such a process will make incorrect decisions when potentially available information (such as syntactic constraints) could have permitted a correct decision. It does, however, offer an extremely low cost alternative to arbitration by syntactic and other knowledge. That is, disambiguation on the basis of full knowledge potentially entails the integration of constraints of various types, across various levels of representation. It may be the case that such processes cannot converge rapidly enough on the correct disambiguated form. Modular Architectures and Statistical Mechanisms 9 For this argument to be compelling, it must also be the case that lexical category ambiguities are frequent enough to warrant a distinct resolution process. This can be verified by determining the number of words that occur with more than one category in a large text corpus. DeRose (1988) has produced such an estimate from the Brown corpus; he found that 11.5% of word types and 40% of tokens occur with more than one lexical category. As the mean length of the sentences in the Brown corpus is 19.4 words, DeRose’s figures suggest that there are 7.75 categorially ambiguous words in an average corpus sentence. Our own investigations suggest the extent of the problem is even greater. Using the TreeBank version of the Brown corpus, we discovered 10.9% ambiguity by type, and a staggering 65.8% by token. To obtain these results, we used the coarsest definition of lexical category possible — just the first letter of the corpus tag (i.e. nouns were not tagged separately as singular, plural, etc.). Given the high frequency of lexical category ambiguity, a separate decision making process makes computational sense, if it can achieve sufficient accuracy. If category ambiguities are resolved prior to parsing, the time required by the parser is reduced (Charniak et al, 1996). A Statistical Lexical Category Module In this section we outline a specific proposal for a Statistical Lexical Category Module (SLCM). The function of the SLCM is to determine the best possible assignment of lexical part-of-speech categories for the words of an input utterance, as they are encountered. The model differs from other theories of sentence processing, in that lexical category disambiguation is postulated as a distinct modular process, which occurs prior to syntactic processing but following lexical access. We argued earlier for a model of human sentence processing that is (at least partially) statistical on both rational and empirical grounds: such a model appears sensible and has characteristics which may explain some of the behaviour patterns of the HSPM. We therefore propose that the SLCM employs a statistically-based disambiguation mechanism, as such a mechanism can operate efficiently (in linear time) and achieve near optimal performance (most words disambiguated correctly, see next section), and we assume such a module would 10 Matthew W. Crocker and Steffan Corley strive for such a rational behaviour. What Statistics? If we accept that the SLCM is statistical, a central question concerns what statistics condition its decisions. Limitations of the modular architecture we are proposing constrain the choice. The SLCM has no access to structural representations; structurally-based statistics could therefore not be expressed in its representational vocabulary. We will assume that the input to the module is extremely shallow — just a word and a set of candidate grammatical classes. In this case, the module also has no access to low level representations including morphs, phonemes and graphic symbols; the module may only make use of statistics collated over words or lexical categories, or combinations of the two. It seems likely that the SLCM collates statistics concerning the frequency of co-occurrence of individual words and lexical categories. One possible model is therefore that the SLCM just picks the most frequent class for each word; for reasons that will become apparent, we will call this the ‘unigram’ approach. The SLCM may also gather statistical information concerning prior context. For example, decisions about the most probable lexical category for a word may also consider the previous word. Alternatively, such decisions may only consider the category assigned to the previous word, or a combination of both the prior word and its category may be used. Probability Theory and the SLCM The problem faced by the SLCM is to incrementally assign the most likely sequence of lexical categories to a given sequence of words as they are encountered. That is, as each word is input to the SCLM, it outputs the most likely category for it. Research in computational linguistics has concentrated on a (non-incremental) version of this problem for a number of years and a number of successful and accurate ‘part-of-speech taggers’ have been built (e.g. Weischedel et al, 1993; Brill, 1995). While a number of heuristic tagging algorithms have been proposed, the majority of modern taggers are statistically based, relying on distributional information about language (DeRose, 1988; Weischedel et al, 1993; Ratnarparkhi, 1996; see also Charniak, 1997 for discussion). It is this set of taggers that we suggest is most suitable for an initial Modular Architectures and Statistical Mechanisms 11 model of statistical lexical category disambiguation. They provide a straightforward learning algorithm based on prior experience, are comparatively simple, employ a predictive and uniform decision strategy (i.e. don’t make use of arbitrary or ad hoc rules), and can be naturally adapted to assign preferred lexical category tags incrementally. The SLCM, as with part-of-speech taggers, is based on a Hidden Markov Model (HMM), and operates by probabilistically selecting the best sequence of category assignments for an input string of words. Let us briefly consider the problem of tag assignment from the perspective of probability theory. The task of the SLCM is to find the best category sequence (t1 ... tn) for an input sequence of words (w1 ... wn). We assume that the ‘best’ such sequence is the one that is most likely, based on our prior experience. Therefore the SLCM must find the sequence (t1 ... tn) such that P(t1 ... tn, w1 ... wn) is maximised. That is, we want to find the tag sequence that maximises the joint probability of the tag sequence and the word sequence. One practical problem, however, is that determining such a probability directly is difficult, if we wish to do so on the basis of frequencies in a corpus (as in the case of taggers) or in our prior experience (as would be the case for the psychological model). The reason is that we may have seen very few (or quite often no) occurrences of a particular word-tag sequence, and thus probabilities will often be estimated as zero. It is therefore common practice to approximate this probability with another which can be estimated more reliably. Corley and Crocker (2000) argue that the SLCM approximates this probability using category bigrams, as follows: The two terms in the right hand side of the equation are the two statistics that we hypothesise to dominate lexical category decisions in the SLCM. P(wi|ti) – the unigram or word-category probability – is the probability of a word given a 4 See Corley and Crocker (in press) or Corley (1998) for a more thorough exposition of HMM taggers and the model being assumed here. See also Charniak (1993;1997), for more general and more formal discussion. P(t0,...tn,w0,...wn) ≈ P(wi | ti)P(ti | t i − 1 i =1 n ∏ ) 12 Matthew W. Crocker and Steffan Corley particular tag. P(ti|ti-1) – the bigram or category co-occurrence probability – is the probability that two tags occur next to each other in a sentence. While the most accurate HMM taggers typically use trigrams (Brants, 1999), Corley and Crocker (2000) argue that the bigram model is sufficient to explain existing data and is simpler (requires fewer statistical parameters). It is therefore to be preferred as a cognitive model, until evidence warrants a more complex model. Estimates for both of these terms are typically based on the frequencies obtained from a relatively small training corpus in which words appear with their correct tags. This equation can be applied incrementally. That is, after perceiving each word we may calculate a contingent probability for each tag path terminating at that word; an initial decision may be made as soon as the word is seen. Figure 1 depicts tagging of the phrase “that old man”. Each of the words has two possible lexical categories, meaning that there are eight tag paths. In the diagram, the most probable tag path is shown by the sequence of solid arcs. Other potential tags are represented by dotted arcs. The tagger’s job is to find this preferred tag path. The probability of a sentence beginning with the start symbol is 1.0. When ‘that’ is encountered, the tagger must determine the likelihood of each reading for this word when it occurs sentence initially. This results in probabilities for two tag paths – start followed by a sentence complementiser and start followed by a determiner. The calculation of each of these paths is shown in Table 1. 5 The use of P(w|t) makes the model appear top-down. See Corley (1998, pp. 85-87) for how this (apparently generative) statistical model is actually derived from an equation based on bottom-up recognition. See also Charniak (1997) for discussion. Figure 1: Tagging the sequence "that old man" start that old man s-comp adj verb

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Software Composability and Mixed Criticality for Triple Modular Redundant Architectures

Composability and mixed criticality are concepts that promise an ease of development and certification for safety critical systems in all industrial domains. In this paper we define the necessary requirements, highlight issues and classify fault containment, when extending already existing triple modular redundant architectures with these concepts. We evaluate the needed adaptations and extensi...

متن کامل

Reliability Driven Probabilistic Design Paradigm for Transient Error Tolerant Architectures on Nanofabrics

Several papers appeared recently on mapping computation onto nanofabrics with defect-mapping followed by defect-avoidance. However, such techniques are for permanent or manufacturing faults. Hence even after defect-avoidance based configuration, the nanofabrics remain susceptible to transient faults. In this paper we extend a hierarchical mapping scheme from recent work of Jacome et al. We add ...

متن کامل

Modular architectures and informational encapsulation: A dilemma

Amongst philosophers and cognitive scientists, modularity remains a popular choice for an architecture of the human mind, primarily because of the supposed explanatory value of this approach. Modular architectures can vary both with respect to the strength of the notion of modularity and the scope of the modularity of mind. We propose a dilemma for modular architectures, no matter how these arc...

متن کامل

Two systolic architectures for modular multiplication

This article presents two systolic architectures to speed up the computation of modular multiplication in RSA cryptosystems. In the double-layer architecture, the main operation of Montgomery's algorithm is partitioned into two parallel operations after using the precomputation of the quotient bit. In the non-interlaced architecture, we eliminate the one-clock-cycle gap between iterations by pa...

متن کامل

Modularity and the Predictive Mind

Modular approaches to the architecture of the mind claim that some mental mechanisms, such as sensory input processes, operate in special-purpose subsystems that are functionally independent from the rest of the mind. This assumption of modularity seems to be in tension with recent claims that the mind has a predictive architecture. Predictive approaches propose that both sensory processing and...

متن کامل

A Hybrid Approach with Modular Neural Networks and Fuzzy Logic for Time Series Prediction

We describe in this paper the application of several neural network architectures to the problem of simulating and predicting the dynamic behavior of complex economic time series. We use several neural network models and training algorithms to compare the results and decide at the end, which one is best for this application. We also compare the simulation results with the traditional approach o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000